摘要 :
High utility itemset mining considers unit profits and quantities of items in a transaction database to extract more applicable and more useful association rules. Downward closure property, which causes significant pruning in freq...
展开
High utility itemset mining considers unit profits and quantities of items in a transaction database to extract more applicable and more useful association rules. Downward closure property, which causes significant pruning in frequent itemset mining, is not established in the utility of itemsets and so the mining problem will require alternative solutions to reduce its search space and to enhance its efficiency. Using an anti-monotonic upper bound of the utility function and exploiting efficient data structures for storing and compacting the dataset to perform efficient pruning strategies are the main solutions to address high utility itemset mining problem. Different mining methods and techniques have attempted to improve performance of extracting high utility itemsets and their several variants, including high-average utility itemsets, top-k high utility itemsets, and high utility itemsets with negative values, using more efficient data structures, more appropriate anti-monotonic upper bounds, and stronger pruning strategies. This paper aims to represent a comprehensive systematic review for high utility itemset mining techniques and to classify them based on their problem-solving approaches.
收起
摘要 :
Erasable itemset (EI) mining is an interesting variation of frequent itemset mining which allows managers to carefully consider their production plans to ensure the stability of the factory. Existing algorithms for EI mining requi...
展开
Erasable itemset (EI) mining is an interesting variation of frequent itemset mining which allows managers to carefully consider their production plans to ensure the stability of the factory. Existing algorithms for EI mining require a lot of time and memory. This paper proposes an effective algorithm, called mining erasable itemsets (MEI), which uses the divide-and-conquer strategy and the difference pidset (dPidset) concept for mining EIs fully. Some theorems for efficiently computing itemset information to reduce mining time up and memory usage are also derived. Experimental results show that MEI outperforms existing approaches in terms of both the mining time and memory usage. Moreover, the proposed algorithm is capable of mining EIs with higher thresholds than those obtained using existing approaches.
收起
摘要 :
Itemset mining is a popular extension to the frequent pattern mining problem in data mining. Finding infrequent patterns, however, has gained its importance due to proven utility in the areas of web mining, bioinformatics and othe...
展开
Itemset mining is a popular extension to the frequent pattern mining problem in data mining. Finding infrequent patterns, however, has gained its importance due to proven utility in the areas of web mining, bioinformatics and others. High utility mining refines the problem focus to identifying business-relevant transaction patterns that take purchase quantities and monetary considerations into account, like unit price and cost, typically to identify patterns of profit potential. High utility infrequent itemset mining unveils rare cases of highly profitable itemsets. This paper proposes a customized Ant colony algorithm for the efficient discovery of high utility infrequent itemsets. The mining performance of proposed algorithm is analyzed on four real time datasets namely chess, food mart, mushroom and retail.
收起
摘要 :
High-utility itemset mining (HUIM) is a popular data mining task with applications in numerous domains. However, traditional HUIM algorithms often produce a very large set of high-utility itemsets (HUIs). As a result, analyzing HU...
展开
High-utility itemset mining (HUIM) is a popular data mining task with applications in numerous domains. However, traditional HUIM algorithms often produce a very large set of high-utility itemsets (HUIs). As a result, analyzing HUIs can be very time consuming for users. Moreover, a large set of HUIs also makes HUIM algorithms less efficient in terms of execution time and memory consumption. To address this problem, closed high-utility itemsets (CHUIs), concise and lossless representations of all HUIs, were proposed recently. Although mining CHUIs is useful and desirable, it remains a computationally expensive task. This is because current algorithms often generate a huge number of candidate itemsets and are unable to prune the search space effectively. In this paper, we address these issues by proposing a novel algorithm called CLS-Miner. The proposed algorithm utilizes the utility-list structure to directly compute the utilities of itemsets without producing candidates. It also introduces three novel strategies to reduce the search space, namely chain-estimated utility co-occurrence pruning, lower branch pruning, and pruning by coverage. Moreover, an effective method for checking whether an itemset is a subset of another itemset is introduced to further reduce the time required for discovering CHUIs. To evaluate the performance of the proposed algorithm and its novel strategies, extensive experiments have been conducted on six benchmark datasets having various characteristics. Results show that the proposed strategies are highly efficient and effective, that the proposed CLS-Miner algorithm outperforms the current state-of-the-art CHUD and CHUI-Miner algorithms, and that CLS-Miner scales linearly.
收起
摘要 :
The problem of discovering high-utility itemsets (HUIs) in transaction databases, which is an extension of Frequent Itemset Mining, is a commonly encountered mining task. Researchers have proposed algorithms to efficiently mine hi...
展开
The problem of discovering high-utility itemsets (HUIs) in transaction databases, which is an extension of Frequent Itemset Mining, is a commonly encountered mining task. Researchers have proposed algorithms to efficiently mine highly profitable itemsets in customer transaction databases, in which the unit profits of items are fixed. However, this assumption does not reflect the true nature of the utility measure of items in real-life transaction databases, which might vary over time. Moreover, since this important characteristic is ignored by all the current HUI mining algorithms, they are either not applicable to this type of database, or they generate inaccurate results. In addition, the HUI mining algorithms' traditional limitation is that they produce a huge number of HUIs for users. In this paper, we define the problem of mining a lossless, concise and compact representation of HUIs, called closed HUIs (CHUIs), in dynamic unit profit databases. Based on newly defined of utility measure, a novel algorithm, called iEFIM-Closed, is introduced. This relies on this new utility measure, a novel compact database format to reduce the cost of database scans and increase the efficiency of the mining process. Experimental evaluations show that iEFIM-Closed significantly outperforms state-of-the-art algorithms for mining CHUIs on sparse databases with dynamic profit, and has competitive performance in dense databases in terms of runtime, the cost of database scans and scalability.
收起
摘要 :
Privacy preserving data mining (PPDM) is the process of protecting sensitive knowledge from being discovered by data mining techniques in case of data sharing. Privacy preserving frequent itemset mining (PPFIM) is a subtask and NP...
展开
Privacy preserving data mining (PPDM) is the process of protecting sensitive knowledge from being discovered by data mining techniques in case of data sharing. Privacy preserving frequent itemset mining (PPFIM) is a subtask and NP-hard problem of PPDM. Its objective is to modify a given database in such a way that none of the sensitive itemsets of the database owner can be obtained by any frequent itemset mining technique from the modified database. The main challenge of PPFIM is to minimize the distortion given to the data and nonsensitive knowledge while sanitizing all given sensitive itemsets. Distortion-based sensitive itemset hiding algorithms decrease the support of each sensitive itemset under a predefined sensitive threshold through sanitization. Most of the distortion-based itemset hiding algorithms allow database owner to define a single sensitive threshold for each sensitive itemset. However, this is a limitation to the database owner since the importance of each sensitive itemset varies. In this paper we propose a distortion-based itemset hiding algorithm that allows database owner to assign multiple sensitive thresholds, namely itemset oriented pseudo graph based sanitization (IPGBS) algorithm. The purpose of IPGBS algorithm is to give minimum distortion to the nonsensitive knowledge and data while hiding all sensitive itemsets. For this reason, the IPGBS algorithm modifies least amount of transaction and transaction content. The performance evaluation of the IPGBS algorithm is conducted by using two different counterparts on four different databases. The results show that the IPGBS algorithm is more efficient in terms of nonsensitive frequent itemset loss on both dense and sparse databases. It has considerable good results in terms of number of transactions modified, number of items deleted, execution time and total memory allocation as well.
收起
摘要 :
Utility itemsets mining is an extension of frequent itemsets mining, considering both the quantities of items in the transactions and the profits of the items. In traditional utility itemsets mining, the utility of an itemset is t...
展开
Utility itemsets mining is an extension of frequent itemsets mining, considering both the quantities of items in the transactions and the profits of the items. In traditional utility itemsets mining, the utility of an itemset is the summation of the utilities of the itemset in all the transactions containing this itemset while ignoring the length of the itemset and will increase along with an increase in its length. To eliminate the effect of the length of an itemset, average utility measurement was proposed, and average utility of an itemset was defined as the total utility of the itemset divided by its length. In this article, we propose a pattern-wise algorithm to mine high average utility itemsets. Experimental results show that the proposed algorithm outperforms the existing HAUP-mine algorithm.
收起
摘要 :
Mining itemsets for association rule generation is a fundamental data mining task originally stemming from the traditional market basket analysis problem. However, enumerating all frequent itemsets, especially in a dense dataset, ...
展开
Mining itemsets for association rule generation is a fundamental data mining task originally stemming from the traditional market basket analysis problem. However, enumerating all frequent itemsets, especially in a dense dataset, or with low support thresholds, remains costly. In this paper, a novel theorem builds the relationship between frequent closed itemsets (FCIs) and frequent generator itemsets (FGIs) and proves that the process of mining FCIs is equivalent to mining FGIs, unified with their full-support and extension items. On the basis of this theorem, a generator-based algorithm for mining FCIs, called GrAFCI(+), is proposed and explained in details including its correctness. The comparative effectiveness of the algorithm in terms of scalability is first investigated, along with the compression rate-a measure of the interestingness of a given FIs representation. Extensive experiments are further undertaken on eight datasets and four state-of-the-art algorithms, namely DCI_CLOSED*, DCI_PLUS, FPClose, and NAFCP. The results show that the proposed algorithm is more efficient regarding the execution time in most cases as compared to these algorithms. Because GrAFCI(+) main goal is to address the runtime issue, it paid a memory cost, especially when the support is too small. However, this cost is not high since GrAFCI(+) is seconded by only one competitor out of four in memory utilization and for large support values. As an overall assessment, GrAFCI(+) gives better results than most of its competitors.
收起
摘要 :
High utility itemsets mining is a subfield of data mining with wide applications. Although the existing high utility itemsets mining algorithms can discover all the itemsets satisfying a given minimum utility threshold, it is ofte...
展开
High utility itemsets mining is a subfield of data mining with wide applications. Although the existing high utility itemsets mining algorithms can discover all the itemsets satisfying a given minimum utility threshold, it is often difficult for users to set a proper minimum utility threshold. A smaller minimum utility threshold value may produce a huge number of itemsets, whereas a higher one may produce a few itemsets. Specification of minimum utility threshold is difficult and time-consuming. To address these issues, top-k high utility itemsets mining has been defined where k is the number of high utility itemsets to be found. In this paper, we present an efficient algorithm (named TKEH) for finding top-k high utility itemsets. TKEH utilizes transaction merging and dataset projection techniques to reduce the dataset scanning cost. These techniques reduce the dataset when larger items are explored. TKEH employs three minimum utility threshold raising strategies. We utilize two strategies to prune search space efficiently. To calculate the utility of items and upper-bounds in linear time, TKEH utilizes array-based utility technique. We carried out some extensive experiments on real datasets. The results show that TKEH outperforms the state-of-the-art algorithms. Moreover, TKEH always performs better for dense datasets.
收起
摘要 :
Itemset mining looks for correlations among data items in large transactional datasets. Traditional in-core mining algorithms do not scale well with huge data volumes, and are hindered by critical issues such as long execution tim...
展开
Itemset mining looks for correlations among data items in large transactional datasets. Traditional in-core mining algorithms do not scale well with huge data volumes, and are hindered by critical issues such as long execution times due to massive memory swap and main-memory exhaustion. This work is aimed at overcoming the scalability issues of existing in-core algorithms by improving their memory usage. A persistent structure, VLDBMine, to compactly store huge transactional datasets on disk and efficiently support large-scale itemset mining is proposed. VLDBMine provides a compact and complete representation of the data, by exploiting two different data structures suitable for diverse data distributions, and includes an appropriate indexing structure, allowing selective data retrieval. Experimental validation, performed on both real and synthetic datasets, shows the compactness of the VLDBMine data structure and the efficiency and scalability on large datasets of the mining algorithms supported by it. (C) 2014 Elsevier Inc. All rights reserved.
收起